Hardware-based data prefetching based on loop-unrolled instructions

ABSTRACT

Prefetching data by determining that a first set of instructions that is processed by a computer processor indicates that a second set of instructions includes multiple iteration groups, where each of the iteration groups includes one or more loop-unrolled instructions, monitoring the second set of instructions as the second set of instructions is processed by the computer processor after the first set of instructions is processed by the computer processor, mapping a corresponding one of the loop-unrolled instructions in each of the iteration groups of the second set of instructions to a stride-tracking record that is shared by the corresponding loop-unrolled instructions, and prefetching data into a cache memory of the computer processor based on the stride-tracking record.

BACKGROUND

Data prefetching is a technique often employed by computer processors toimprove execution performance by retrieving data from slow-accessstorage, typically main memory, to fast-access local storage, typicallycache memory, before the data are actually needed for processing. Dataprefetching strategies typically leverage situations in which sequentialdata items are stored contiguously in statically-allocated memory, suchas is typically the case with array-based data that are to be retrievedand processed in the order in which they are stored. For example, whenthe following programming loop is used to access a data array:

for (int i=0; i<1024; i++) {  array1[i] = array1[i] + 1; }the i-th element of the array “array1” is accessed at each iteration.Thus, array elements that are going to be accessed in future iterationsmay be prefetched before the future iterations occur.

In hardware-based prefetching, a computer processor includes a mechanismthat monitors the stream of instructions of a program during itsexecution, recognizes elements that the program might access in thefuture based on this stream, and prefetches such elements into theprocessor's cache. In the above programming loop example, a type ofhardware-based prefetching known as “strided prefetching” may be used toidentify instructions for which data are accessed at a computer memoryaddress, determine that the same instruction at the same instructionaddress is executed multiple times, where each time data are accessed ata different computer memory address, and determine the number ofintermediate addresses from one such computer memory address to thenext, known as a “stride.” Once a consistent stride pattern isestablished for such an instruction at a given instruction address, datamay be prefetched from computer memory addresses that are multiplestrides ahead of the computer memory address most recently accessed bythe instruction. In order to monitor such instructions, hardware-basedstrided prefetching mechanisms typically maintain a stride-trackingrecord in a history table of such records for each such instruction, thestride-tracking record indicating the address of the instruction andtracking the stride between the computer memory addresses accessed eachtime the same instruction is executed. A consistent stride typicallytakes three iterations of a prefetching candidate instruction, where itsstride is determined during the second iteration and is verified duringthe third iteration. Thus, in the above example, if a consistent strideis verified when array1[2] is fetched from computer memory, prefetchingcan be begun starting with the computer memory location at the nextstride.

Unfortunately, hardware-based strided prefetching is complicated byoptimizing compilers that attempt to improve a program's executionperformance by employing “loop unrolling” techniques, whereby loopinstructions that would otherwise be performed in repeated iterationsare transformed into a repeated sequence of instructions that requirefewer iterations. Thus, in the above programming loop example, the loopmay be transformed into separate instructions in a loop-unrolled formatequivalent to the following instructions:

for (int i=0; i<1024; i+5) {  array1[i] =array1[i] + 1;  array1[i+1] =array1[i+1] + 1;  array1[i+2] = array1[i+2] + 1;  array1[i+3] =array1[i+3] + 1;  array1[i+4] = array1[i+4] + 1; }

If hardware-based strided prefetching is then applied in the mannerdescribed above, since each of the array access instructions above willbe transformed into five corresponding instructions requiring memoryaccess, each having a different instruction address, five separatestride-tracking records will be required to track the strides betweenthe computer memory addresses accessed by their correspondinginstructions. Where a computer processor is configured with a limitednumber of stride-tracking records, this can result in thrashing of thehistory table, aliasing when mapping instruction addresses tostride-tracking records, or contention, any of which may result inreducing the effectiveness of the hardware-based prefetching mechanism.Also, given that in the loop-unrolled example above a consistent stridecan only be verified for instruction array1 [i]=array1 [i]+1 during itsthird iteration, when fetching array1[10], prefetching might not evenoccur when short loops are loop-unrolled.

SUMMARY

In one aspect of the invention a method is provided for prefetchingdata, the method including determining that a first set of instructionsthat is processed by a computer processor indicates that a second set ofinstructions includes multiple iteration groups, where each of theiteration groups includes one or more loop-unrolled instructions,monitoring the second set of instructions as the second set ofinstructions is processed by the computer processor after the first setof instructions is processed by the computer processor, mapping acorresponding one of the loop-unrolled instructions in each of theiteration groups of the second set of instructions to a stride-trackingrecord that is shared by the corresponding loop-unrolled instructions,and prefetching data into a cache memory of the computer processor basedon the stride-tracking record.

In other aspects of the invention systems and computer program productsembodying the invention are provided.

BRIEF DESCRIPTION OF THE DRAWINGS

Aspects of the invention will be understood and appreciated more fullyfrom the following detailed description taken in conjunction with theappended drawings in which:

FIG. 1 is a simplified conceptual illustration of a system forprefetching data, constructed and operative in accordance with anembodiment of the invention; and

FIG. 2 is a simplified flowchart illustration of an exemplary method ofoperation of the system of FIG. 1, operative in accordance with anembodiment of the invention.

DETAILED DESCRIPTION

Reference is now made to FIG. 1, which is a simplified conceptualillustration of a system for prefetching data, constructed and operativein accordance with an embodiment of the invention. In the system of FIG.1, a computer processor 100 is shown including a main memory 102, acache memory 104, and a prefetcher 106 preferably assembled therewith.Prefetcher 106 is configured to monitor instructions as they areprocessed by computer processor 100, such as a set of instructions 108,to identify instructions that are prefetching candidates, such asmemory-to-register “load” instructions for which data are accessed at acomputer memory address and loaded into a register of computer processor100, such as a register 110, as well as “store” instructions thatrequire cache lines to be copied from computer memory into cache memoryprior to modifying and storing the cache line back to computer memory.Prefetcher 106 maintains a history table 112 that includes a number ofstride-tracking records, such as stride-tracking records 114A, 114B, and114C, for tracking such prefetching candidate instructions in accordancewith conventional hardware-based strided prefetching techniques, exceptas is otherwise described herein.

In accordance with an embodiment of the invention, prefetcher 106 isconfigured to determine that the first set of instructions 116 that isprocessed by the computer processor 100 indicates that a second set ofinstructions 118 includes two or more iteration groups 120, where eachof the iteration groups 120 includes one or more loop-unrolledinstructions. Preferably, the second set of instructions 118 immediatelyfollows the first set of instructions 116, such as where both the firstset of instructions 116 and the second set of instructions 118 areincluded in a parent set of instructions, such as the set ofinstructions 108. The loop-unrolled instructions are typicallyassociated with a loop of instructions, where each of the iterationgroups 120 corresponds to a different iteration of the loop ofinstructions.

The first set of instructions 116 preferably includes one or moreinstructions that, taken together, provides the following informationregarding the second set of instructions 118:

a) a count of the iteration groups 120 in the second set of instructions118;

b) a count of the loop-unrolled instructions in any of the iterationgroups 120, where the count of the loop-unrolled instructions is thesame for each of the iterations groups 120; and

c) a count of instruction bytes in any of the iteration groups 120,where the count of instruction bytes is the same for each of theiterations groups 120.

The first set of instructions 116 is preferably configured by anoptimizing compiler to include the above information regarding thesecond set of instructions 118 based on loop unrolling that theoptimizing compiler applies to a loop of instructions that are compiledby the optimizing compiler, where the resulting loop-unrolledinstructions are included in the second set of instructions 118.

Prefetcher 106 is configured to monitor the second set of instructions118 as the second set of instructions 118 is processed by the computerprocessor 100 after the first set of instructions 116 is processed bythe computer processor 100. Using the information provided by the firstset of instructions 116 regarding the second set of instructions 118,prefetcher 106 maps a corresponding one of the loop-unrolledinstructions in each of the iteration groups 120 of the second set ofinstructions 118 to a corresponding one of stride-tracking records 114A,114B, and 114C that is shared by each of the corresponding loop-unrolledinstructions, provided that prefetcher 106 has identified theloop-unrolled instruction being mapped as a prefetching candidateinstruction. Thus, for example, where the first set of instructions 116indicates that the second set of instructions 118 includes threeiteration groups 120, and each iteration group 120 includes threeprefetching candidate instructions, prefetcher 106 maps the firstprefetching candidate instruction in each of the three iteration groups120 to stride-tracking record 114A, maps the second prefetchingcandidate instruction in each of the three iteration groups 120 tostride-tracking record 114B, and maps the third prefetching candidateinstruction in each of the three iteration groups 120 to stride-trackingrecord 114C. Prefetcher 106 preferably maps each prefetching candidateinstruction in each of the iteration groups 120 in the second set ofinstructions 118 to a stride-tracking record by applying a predefinedmapping function to a combination of

a) the instruction address of the corresponding loop-unrolledinstruction being mapped, where each of the corresponding loop-unrolledinstructions has a different instruction address;

b) the index number of the iteration group 120 of the correspondingloop-unrolled instruction being mapped, where the iteration groups 120form a sequence within the second set of instructions 118, and where theindex number indicates the position of the iteration group 120 (of thecorresponding loop-unrolled instruction being mapped) within thesequence of iteration groups 120; and

c) the count of the instruction bytes within the iteration group 120 ofthe corresponding loop-unrolled instruction being mapped.

Prefetcher 106 then prefetches data in accordance with conventionalstrided prefetching techniques based on any of the stride-trackingrecords that meets one or more predefined eligibility criteria forprefetching. For example, when a given stride-tracking record indicatesthat a consistent stride is encountered a predefined number ofconsecutive times for corresponding prefetching candidate instructions(in multiple iteration groups 120) that are mapped to the givenstride-tracking record, prefetcher 106 prefetches data into cache memory104 from computer memory addresses of main memory 102 that are apredefined number of strides ahead of the computer memory address mostrecently accessed by one of the corresponding prefetching candidateinstructions.

Operation of the system of FIG. 1 may be illustrated in the context ofthe following example in which the following loop of instructions:

for (int i=0; i<1024; i++) {  instruction1; // Eq. to Load R1←A[i] instruction2; // Eq. to Load R2←B[i]  instruction3; // Eq. to LoadR3←C[i]  . . . }is loop-unrolled by an optimizing compiler into the equivalent of thefollowing loop of instructions:

for (int i=0; i<1024; i+4) { instruction1; // Eq. to Load R1←A[i]instruction2; // Eq. to Load R2←B[i] instruction3; // Eq. to LoadR3←C[i] . . . // 8 instructions that are not prefetching candidatesinstruction1; // Eq. to Load R1←A[i+1] instruction2; // Eq. to LoadR2←B[i+1] instruction3; // Eq. to Load R3←C[i+1] . . . // 8 instructionsthat are not prefetching candidates instruction1; // Eq. to LoadR1←A[i+2] instruction2; // Eq. to Load R2←B[i+2] instruction3; // Eq. toLoad R3←C[i+2] . . . // 8 instructions that are not prefetchingcandidates instruction1; // Eq. to Load R1←A[i+3] instruction2; // Eq.to Load R2←B[i+3] instruction3; // Eq. to Load R3←C[i+3] . . . // 8instructions that are not prefetching candidates }where A, B, and C denote different data arrays, and where R1, R2, and R3denote different registers of computer processor 100.

The optimizing compiler configures the first set of instructions 116 toinclude the following single instruction:

Instruction Address Instruction X−4 Loop-Unrolled, 4, 11, 44indicating that the second set of instructions 118 immediately followingthe first set of instructions 116 includes the following:a) 4 iteration groups 120;b) 11 loop-unrolled instructions in each iteration group 120; andc) 44 instruction bytes in each iteration group 120.

The optimizing compiler configures the second set of instructions 118 toinclude the following loop-unrolled instructions:

Instruction Address Instruction: X Load R1←A[i] X+4 Load R2←B[i] X+8Load R3←C[i] . . . // 8 instructions that are not prefetching candidatesX+44 Load R1←A[i+1] X+48 Load R2←B[i+1] X+52 Load R3←C[i+1] . . . // 8instructions that are not prefetching candidates X+88 Load R1←A[i+2]X+92 Load R2←B[i+2] X+96 Load R3←C[i+2] . . . // 8 instructions that arenot prefetching candidates X+132 Load R1←A[i+3] X+136 Load R2←B[i+3]X+140 Load R3←C[i+3] . . . // 8 instructions that are not prefetchingcandidates X+176 i=i+4 X+178 Branch

The four iteration groups 120 thus form a sequence within the second setof instructions 118 as follows:

Sequence Index No. Instruction Addresses 0 X through X+40 1 X+44 throughX+84 2 X+88 through X+128 3 X+132 through X+172

In this example, history table 112 includes 16 stride-tracking records.Prefetcher 106 maps each prefetching candidate instruction in each ofthe iteration groups 120 in the second set of instructions 118 to astride-tracking record by applying the following predefined mappingfunction to the instruction address (IA) of the instruction beingmapped, the sequence index number (IndexNo) of the iteration group 120to which the instruction being mapped belongs, and the count of theinstruction bytes (ByteCount) within each iteration group 120 asfollows:Stride-tracking record no.=Value of 4 lowest bits of(IA−(IndexNo*ByteCount))

Thus, assuming that the value of 4 lowest bits of instruction address Xis 0000, the prefetching candidate instructions in the second set ofinstructions 118 will be mapped to stride-tracking records as follows:

Instruction Address Stride-tracking record no.: X 0000 X+4 0100 X+8 1000X+44 0000 X+48 0100 X+52 1000 X+88 0000 X+92 0100 X+96 1000 X+132 0000X+136 0100 X+140 1000

Thus, the first prefetching candidate instruction in each of theiteration groups 120 is mapped to the same stride-tracking record no.0000, the second prefetching candidate instruction in each of theiteration groups 120 is mapped to the same stride-tracking record no.0100, and the third prefetching candidate instruction in each of theiteration groups 120 is mapped to the same stride-tracking record no.1000. Prefetcher 106 records in each given stride-tracking record thestride between the addresses of the accessed data locations for each ofthe corresponding prefetching candidate instructions in each of theiteration groups 120 that are mapped to the same given stride-trackingrecord. Prefetcher 106 then prefetches data in accordance withconventional strided prefetching techniques based on any of thestride-tracking records that meets predefined eligibility criteria forprefetching.

Reference is now made to FIG. 2, which is a simplified flowchartillustration of an exemplary method of operation of the system of FIG.1, operative in accordance with an embodiment of the invention. In themethod of FIG. 2, instructions are monitored as they are processed by acomputer processor (step 200). A first set of instructions is identifiedthat indicates that a second set of instructions that is about to beprocessed includes multiple iteration groups of loop-unrolledinstructions, where the first set of instructions provides a count ofthe iteration groups, a count of the loop-unrolled instructions periteration group, and a count of instruction bytes per iteration group(step 202). As each loop-unrolled instruction of an iteration group isprocessed, if the loop-unrolled instruction is a prefetching candidateinstruction (step 204), a predefined mapping function is applied to acombination of the instruction address of the loop-unrolled instruction,an index number indicating the position of its iteration group within asequence of the iteration groups, and the instruction byte count (step206) to identify a corresponding stride-tracking record that is sharedby a corresponding loop-unrolled instruction in each of the iterationgroups (step 208). Stride information associated with correspondingloop-unrolled instructions in each of the iteration groups is stored intheir shared stride-tracking record (step 210). Data are prefetched inaccordance with conventional strided prefetching techniques based on anyof the stride-tracking records that meets predefined eligibilitycriteria for prefetching (step 212).

It is to be appreciated that the term “processor” as used herein isintended to include any processing device, such as, for example, onethat includes a CPU (central processing unit) and/or other processingcircuitry. It is also to be understood that the term “processor” mayrefer to more than one processing device and that various elementsassociated with a processing device may be shared by other processingdevices.

The term “memory” as used herein is intended to include memoryassociated with a processor or CPU, such as, for example, RAM, ROM, afixed memory device (e.g., hard drive), a removable memory device (e.g.,diskette), flash memory, etc. Such memory may be considered a computerreadable storage medium.

In addition, the phrase “input/output devices” or “I/O devices” as usedherein is intended to include, for example, one or more input devices(e.g., keyboard, mouse, scanner, etc.) for entering data to theprocessing unit, and/or one or more output devices (e.g., speaker,display, printer, etc.) for presenting results associated with theprocessing unit.

Embodiments of the invention may include a system, a method, and/or acomputer program product. The computer program product may include acomputer readable storage medium (or media) having computer readableprogram instructions thereon for causing a processor to carry outaspects of the invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Java, Smalltalk, C++ or the like,and conventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the invention.

Aspects of the invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The descriptions of the various embodiments of the invention have beenpresented for purposes of illustration, but are not intended to beexhaustive or limited to the embodiments disclosed. Many modificationsand variations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to best explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

What is claimed is:
 1. A method for prefetching data, the methodcomprising: determining that a first set of instructions that isprocessed by a computer processor indicates that a second set ofinstructions includes a plurality of iteration groups, wherein each ofthe iteration groups includes one or more loop-unrolled instructions;monitoring the second set of instructions as the second set ofinstructions is processed by the computer processor after the first setof instructions is processed by the computer processor; mapping acorresponding one of the loop-unrolled instructions in each of theiteration groups of the second set of instructions to a stride-trackingrecord that is shared by the corresponding loop-unrolled instructions;and refetching data into a cache memory of the computer processor basedon the stride-tracking record, wherein the second set of instructionsimmediately follows the first set of instructions in a parent set ofinstructions that includes the first set of instructions and the secondset of instructions.
 2. The method of claim 1 wherein the prefetching isperformed if the stride-tracking record meets a predefined criterion. 3.The method of claim 1 wherein the mapping is performed if thecorresponding loop-unrolled instructions are prefetching candidateinstructions.
 4. The method of claim 1 wherein the loop-unrolledinstructions are associated with a loop of instructions, and whereineach of the iteration groups corresponds to a different iteration of theloop of instructions.
 5. The method of claim 1 wherein the first set ofinstructions indicates: a) a count of the iteration groups; b) a countof the loop-unrolled instructions in any of the iteration groups,wherein the count of the loop-unrolled instructions is the same for eachof the iterations groups; and c) a count of instruction bytes in any ofthe iteration groups, wherein the count of instruction bytes is the samefor each of the iterations groups.
 6. The method of claim 1 wherein eachof the corresponding loop-unrolled instructions has a differentinstruction address.
 7. The method of claim 1 wherein the iterationgroups form a sequence within the second set of instructions, andwherein the mapping comprises applying a mapping function to acombination of: a) an instruction address of the correspondingloop-unrolled instruction being mapped; b) an index number, within thesequence, of the iteration group of the corresponding loop-unrolledinstruction being mapped; and c) a count of instruction bytes within theiteration group of the corresponding loop-unrolled instruction beingmapped.
 8. A system for prefetching data, the system comprising: acomputer processor; and a prefetcher assembled with the computerprocessor, wherein the prefetcher is configured to: determine that afirst set of instructions that is processed by the computer processorindicates that a second set of instructions includes a plurality ofiteration groups, wherein each of the iteration groups includes one ormore loop-unrolled instructions; monitor the second set of instructionsas the second set of instructions is processed by the computer processorafter the first set of instructions is processed by the computerprocessor; map a corresponding one of the loop-unrolled instructions ineach of the iteration groups of the second set of instructions to astride-tracking record that is shared by the corresponding loop-unrolledinstructions; and prefetch data into a cache memory of the computerprocessor based on the stride-tracking record, wherein the second set ofinstructions immediately follows the first set of instructions in aparent set of instructions that includes the first set of instructionsand the second set of instructions.
 9. The system of claim 8 wherein theprefetcher is configured to prefetch the data if the stride-trackingrecord meets a predefined criterion.
 10. The system of claim 8 whereinthe loop-unrolled instructions are prefetching candidate instructions.11. The system of claim 8 wherein the loop-unrolled instructions areassociated with a loop of instructions, and wherein each of theiteration groups corresponds to a different iteration of the loop ofinstructions.
 12. The system of claim 8 wherein the first set ofinstructions indicates: a) a count of the iteration groups; b) a countof the loop-unrolled instructions in any of the iteration groups,wherein the count of the loop-unrolled instructions is the same for eachof the iterations groups; and c) a count of instruction bytes in any ofthe iteration groups, wherein the count of instruction bytes is the samefor each of the iterations groups.
 13. The system of claim 8 whereineach of the corresponding loop-unrolled instructions has a differentinstruction address.
 14. The system of claim 8 wherein the iterationgroups form a sequence within the second set of instructions, andwherein the prefetcher is configured to map by applying a mappingfunction to a combination of: a) an instruction address of thecorresponding loop-unrolled instruction being mapped; b) an indexnumber, within the sequence, of the iteration group of the correspondingloop-unrolled instruction being mapped; and c) a count of instructionbytes within the iteration group of the corresponding loop-unrolledinstruction being mapped.